-
-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch sample_single int to O'neill's modulo method & update gen_rang… #1154
switch sample_single int to O'neill's modulo method & update gen_rang… #1154
Conversation
…e_high benchmarks doc shortcutting sample_single int method mention which methods
This breaks the value-stability tests, so they would need to be updated. The performance gains look good! How does it compare to just using |
Broken value stability means that we should release it only in |
@newpavlov Here is our policy. I don't quite understand what is meant by "portable item", but it sounds like this PR would be considered a breaking change. |
@vks
I think it was an attempt to introduce two classes of deterministic algorithms: value stable and not. For example, into the latter class probably should fall algorithms working with floats, since we can not provide bit-level value stability for them. Floats being floats and all. But it's not clear where exactly we define class for an algorithm (maybe we should introduce yet another marker trait?). |
It's not well worded I agree. Possibly "item" should be replaced with "method". But, yes, this would be considered a breaking change, and yes the test results need to be updated. @newpavlov see the last section on that page:
|
Compared to just name Uniform ns/iter gen_range ns/iter diff ns/iter diff % speedup
+i128_high 6,775,080 (236 MB/s) 3,057,767 (523 MB/s) -3,717,313 -54.87% x 2.22
+i128_low 5,291,480 (302 MB/s) 1,287,980 (1242 MB/s) -4,003,500 -75.66% x 4.11
+i16_high 691,000 (289 MB/s) 286,856 (697 MB/s) -404,144 -58.49% x 2.41
+i16_low 701,163 (285 MB/s) 262,461 (762 MB/s) -438,702 -62.57% x 2.67
+i32_high 1,748,010 (228 MB/s) 1,390,462 (287 MB/s) -357,548 -20.45% x 1.26
+i32_low 674,240 (593 MB/s) 227,620 (1757 MB/s) -446,620 -66.24% x 2.96
+i64_high 2,415,940 (331 MB/s) 1,478,100 (541 MB/s) -937,840 -38.82% x 1.63
+i64_low 1,245,925 (642 MB/s) 255,013 (3137 MB/s) -990,912 -79.53% x 4.89
+i8_high 703,913 (142 MB/s) 247,025 (404 MB/s) -456,888 -64.91% x 2.85
+i8_low 713,518 (140 MB/s) 244,256 (409 MB/s) -469,262 -65.77% x 2.92 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me. Thanks @TheIronBorn.
We should hold off on merging until we're preparing to release 0.9 (or 1.0 as the case may be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Ignore]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modifies sample_single_inclusive
but not sample
.
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
I'm (still) working on #1196, which will replace this. I'd like to leave it open for now as a reference (though I think we won't end up using this code for any of the variants). |
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
This is based on @TheIronBorn's work (#1154, #1172), with some changes.
Fixes #1145 by using the modulo-widening-multiplication method while shortcutting modulo in many cases. Specifically the two shortcuts proposed by O'Neill here.
See this chart for 16-bit rejection/acceptance probabilities and modulo calculation chance:
At some ranges below one third the full size the rejection probabilities are equal and so the modulo may slow things down but the fewer rejections of other low ranges more than make up for it. Note that above one-third of the range size no modulo is ever performed and the rejection chances are equal. Note also that we use 32-bits internally for 8/16 bits so this chart isn't comparable for those.
This is around 2.5x faster than the previous method for lower ranges and about 0.92x slower for high ranges, discounting 8/16 bits. It seems the the extra branches slow things down slightly when using the same rejection chance. The 128-bit modulo also seems to slow things down a bit. For 8/16 bits the high ranges are not so affected due to the 32-bit internals.
Benchmarks:
I also changed the
gen_range_high
benchmarks to use a more sensible region of the integer, just above half where both rejection chances are highest. The old ones didn't seem to have any pattern to them. For 8/16 bits we could try to use ranges even closer to the 8/16 bit max for more rejections thanks to 32-bit internals.This breaks value stability of course. I wasn't sure if I should update those tests with new values.
And here's a Criterion benchmark as well: